November 15, 2020

Single Precision Numbers

Storing numbers with comma means translating into binary without taking too much space.

Single Precision Numbers

On 32 bits, we divid the bits with 1 sign bit s, 8 exponent bits e and the remaining 23 bits for the fractional part:

The formula for decoding a 32-bit floating point number is as follows:

$$n_(10) = (-1)^s * 2^e * ( 1 + \sum_i b_(23-i) * 2^(-i))$$

where n_(10) is the resulting decimal number, s is the sign bit (most significant bit), e is the decimal value corresponding to the 8 exponent bits and b_i are the bits number i.

Sign bit

The most significant bit (bit 31) is the sign bit. 0 means we encoded a positive number, and 1 is negative.

Exponent encoding

The exponent e is not encoded using the two's complement representation, but with a different one: the offset-binary representation with the zero offset being 127. This means that 0000 \, 0000_2 represents -126, 1000 \, 0000_2 represents 0 and 1111 \, 1111_2 represents 127.

Fraction encoding

The fractional part of the number is encoded with standard binary encoding. There is a simple method to convert a decimal fractional part into binary:

multiply by two * take the integer part (either 0 or 1) which will be the binary bit number -1 (bit number 22 in our 32-bit floating-point encoding) * multiply the fractional part of the number obtained by 2 * repeat for bit number -2 ... -22 (bits 21 to 0 in 32-bit floating-point encoding)

For example, for 0.345:

Multiply by 2	Integer part	Fraction part	Bit number in 32-bit representation
0.345 * 2 = 0.690	0	0.690	22
0.690 * 2 = 1.380	1	0.380	21
0.380 * 2 = 0.760	0	0.760	20
0.760 * 2 = 1.520	1	0.520	19
0.520 * 2 = 1.040	1	0.040	18
0.040 * 2 = 0.080	0	0.080	17
..	..	..	..
0.880 * 2 = 1.760	1	0.760	0

Range and Precision

The fractional part is stored with 23 bits. This allows a precision of between 7 and 9 significant digits (2^(23) = 8 \, 388 \, 608). The exponent is stored on 8 bits, which allows numbers from 2^(-126) \approx 1.175 * 10^(-38) to 2^(127) \approx 1.701 * 10^(38).